Soc 723

Spring 2024

Stephen Vaisey

Theoretical background

Asking causal questions

  • Does more education cause higher wages?
  • Does participating in a job training program cause a higher probability of employment?
  • Do boycotts cause a drop in a company’s share price?
  • These are tough questions!

Threats to causal inference

Key term: Identification

How do we identify the effect of a treatment (cause) on an outcome?

Experimental independent variables

  • What is an experiment?
  • How do experiments solve the problem we just talked about?

Experiments are great!

Assuming successful randomization to treatment and control, you know it’s the treatment that’s causing the effect.

Experiments can’t do everything

  • ethics
  • external validity
  • often non-representative
  • some treatments are hard or impossible to assign randomly
    • motherhood
    • divorce
    • boycotts

Why experiments work

Some notation

\(T\) a binary treatment variable

\(Y\) the value of the outcome we observe

\(Y^0\) the value the outcome would take if \(T=0\)

\(Y^1\) the value the outcome would take if \(T=1\)

Let’s think about the last two a bit more carefully…

The world before the experiment

Subject Y0 Y1 T Y
Andrew 2 3 NA NA
Barb 3 4 NA NA
Catherine 3 4 NA NA
David 2 3 NA NA

What do these numbers mean?

The world after the experiment

Subject Y0 Y1 T Y
Andrew NA 3 1 3
Barb 3 NA 0 3
Catherine NA 4 1 4
David 2 NA 0 2

\[ Y = TY^1+(1-T)Y^0 \]

Potential outcomes and counterfactuals

  • \(Y = Y^1\) for \(T = 1\)
  • \(Y = Y^0\) for \(T = 0\)
  • We can’t know \(Y^1\) for those who are \(T=0\)
  • We can’t know \(Y^0\) for those who are \(T=1\)
  • This is the fundamental problem of causal inference.

Potential outcomes and counterfactuals

  • \(Y^0\) and \(Y^1\) are potential outcomes.
  • In the real world, \(T\) is either 1 or 0 for each case.
  • We see \(Y^1\) or \(Y^0\), but never both.
  • When \(T=0\), \(Y^1\) is counterfactual
  • When \(T=1\), \(Y^0\) is counterfactual

What do we want to know?

We really care about the difference between \(Y^0\) and \(Y^1\). (Why?)

Let \(\delta_i = y^1_i - y^0_i\)

\(E[\delta]=E[Y^1-Y^0]\)

\(E[\delta]=E[Y^1]-E[Y^0]\)

This is the definition of a treatment effect.

Assume an experiment: what is \(E[\delta]\)?

Subject Y0 Y1 T Y
Andrew 2 3 NA NA
Barb 3 4 NA NA
Catherine 3 4 NA NA
David 2 3 NA NA

If we could see this (invisible) world, what would be our calculation of the treatment effect?

Why can’t we make these calculations in real life?

Assume an experiment: what is \(E[\delta]\)?

Subject Y0 Y1 T Y
Andrew NA 3 1 3
Barb 3 NA 0 3
Catherine NA 4 1 4
David 2 NA 0 2

Does this give us the right answer? Why?

Why do experiments work?

\(T \bot Y^0\)

\(T \bot Y^1\)

\(E[Y^0 | T = 0] = E[Y^0 | T = 1 ]\)

\(E[Y^1 | T = 0] = E[Y^1 | T = 1 ]\)

Or in other words…

In a properly executed experiment, there is no association between the potential outcome variables and treatment assignment.

\(E[Y^0 | T = 0] \simeq E[Y^0]\)

\(E[Y^1 | T = 1] \simeq E[Y^1]\)

So…

\(E[\delta] = E[Y|T=1]-E[Y|T=0]\)

The difference between the treatment average and the control average

What is this treatment effect?

\(E[\delta]\) is the expected value (mean) of the difference between each unit’s value of \(Y^1\) and \(Y^0\). It is the average treatment effect (ATE). In a sample, this is the sample average treatment effect (SATE).

Even though the individual differences are unobservable (because either \(Y^0\) or \(Y^1\) will be counterfactual for each unit), we can estimate the mean difference via experiment.

\[\text{SATE} = \frac{1}{n}\sum_{i=1}^{n}(y^1_i - y^0_i)\]

Randomization

  • Experiments identify the SATE because cases are randomly assigned to the treatment and control group and are, therefore, identical on average, on all pre-treatment characteristics.
  • Experiments are sometimes called randomized controlled trials (or RCTs)

Bias in observational data

  1. Treated and control cases might be different from each other even in the same treatment state (baseline bias)
  2. Treated and control cases might respond differently to treatment (treatment effect heterogeneity)

Key term: baseline bias

Subject Y0 Y1 T Y
Andrew 4000 4000 NA NA
Barb 2000 2000 NA NA
Catherine 3000 3000 NA NA
David 3000 3000 NA NA

Baseline bias

Subject Y0 Y1 T Y
Andrew NA 4000 1 4000
Barb 2000 NA 0 2000
Catherine NA 3000 1 3000
David 3000 NA 0 3000

Here, the people who go to college have different baseline earnings that have nothing to do with going to college.

The “naive estimator”

Subject Y0 Y1 T Y
Andrew NA 4000 1 4000
Barb 2000 NA 0 2000
Catherine NA 3000 1 3000
David 3000 NA 0 3000

\[\hat{\delta}_{naive} = E[Y | T = 1] - E[Y | T = 0]\]

\[\hat{\delta}_{naive} = 3500 - 2500 = 1000\]

Is this a good estimate of the SATE? Why or why not?

Key term: treatment heterogeneity

The treatment may not have a single effect, but may have different effects for different groups in the population. If the treatment and control groups (would) respond differently to treatment, this can bias the estimate of the effect.

Example 1

Subject Y0 Y1 T Y
Andrew 2000 4000 NA NA
Barb 2000 2000 NA NA
Catherine 2000 4000 NA NA
David 2000 2000 NA NA

What is the effect of a college degree here?

Example 1

Subject Y0 Y1 T Y
Andrew NA 4000 1 4000
Barb 2000 NA 0 2000
Catherine NA 4000 1 4000
David 2000 NA 0 2000

How might we calculate the effect of a college degree here?

Does this give the right answer? Why or why not?

Example 2

Subject Y0 Y1 T Y
Andrew 3000 4000 NA NA
Barb 2000 3500 NA NA
Catherine 3000 4000 NA NA
David 2000 3500 NA NA

What is the effect of a college degree here?

Example 2

Subject Y0 Y1 T Y
Andrew NA 4000 1 4000
Barb 2000 NA 0 2000
Catherine NA 4000 1 4000
David 2000 NA 0 2000

How might we calculate the effect of a college degree on earnings here?

Does this give the right answer? Why or why not?

Three basic types of treatment effects

  • Average treatment effect (ATE)
  • Average treatment effect on the treated (ATT or ATET)
  • Average treatment effect on the controls (or untreated) (ATC or ATU)

What is the difference?

  • ATE is \(E(Y^1 - Y^0)\) for all units (effect of switching)

  • ATT is \(E(Y^1 - Y^0)\) for treated units (effect of taking away treatment)

  • ATC is \(E(Y^1 - Y^0)\) for untreated units (effect of adding treatment)

Exercise: calculating treatment effects

Group \(E(Y^1)\) \(E(Y^0)\)
College degree 1000 600
No degree 800 500

If 30% of the population has a degree…

  • What is the naive estimate?
  • What are the ATE, ATT, and ATC?

What is your estimand?

Directed Acyclic Graphs

Intro to DAGs for causal systems

  • Y: outcome

  • T: treatment

  • U: unobserved confounder

  • S: affects selection into T

  • X: affects Y directly

Regression vs. matching/weighting

Regression attempts to identify \(T \rightarrow Y\) by adjusting for \(X\) while regressing \(Y\) on \(T\)

Matching and weighting attempt to identify \(T \rightarrow Y\) by ensuring that \(S\) has the same distribution for all values of \(T\)

Both are strategies to close the backdoor path between \(T\) and \(Y\)

Caveat: neither works here

These techniques only allow us to account for observed differences between treated and control cases. If \(V\) is unobserved, we can’t close the backdoor path with either of these approaches.

Exact Matching

The logic of exact matching

  • We want to compare “apples to apples”
  • In an experiment, we compare groups who differ only in treatment status
  • We want to simulate that by comparing groups that are similar in all (known) respects except the treatment

Simulation

Simulation

set.seed(1234)

# create the data 

obs <- 1e6   # 1M observations to minimize randomness

d <- tibble(
  U = rbinom(obs, 1, .5) ,                          # unobserved difference
  S = rbinom(obs, 1, .25 + .5*U) ,                  # U -> S
  T = rbinom(obs, 1, .25 + .5*S) ,                  # S -> T
  Y0 = round(2000 + 2000*U + rnorm(obs, 0, 500)) ,  # untreated outcome
  Y1 = round(3000 + 2000*U + rnorm(obs, 0, 500)) ,  # treated outcome
  Y = T*Y1 + (1-T)*Y0) |>                           # observed Y 
  select(-U, -(Y0:Y1))                              # keep observed variables

Naive estimate

d |> group_by(T) |> summarize(mY = round(mean(Y)))
# A tibble: 2 × 2
      T    mY
  <int> <dbl>
1     0  2749
2     1  4249
lm( Y ~ T , data = d )

Call:
lm(formula = Y ~ T, data = d)

Coefficients:
(Intercept)            T  
       2749         1500  

Subclassifying on S

S T mean_y
0 0 2499
0 1 3498
1 0 3499
1 1 4500

What is the difference within levels of S? Does this give us the right answer? Why?

Review: experimental assumptions

In an experiment, the treatment and control groups are otherwise the same.

Assumption 1: \(E(Y^1|T=1) = E(Y^1|T=0)\)

Assumption 2: \(E(Y^0|T=1) = E(Y^0|T=0)\)

Conditional Independence Assumption (CIA)

There exist some observable variables, \(S\), which completely account for the differences between the treatment and control groups.

Assumption 1-S: \(E(Y^1|T=1,S) = E(Y^1|T=0,S)\)

Assumption 2-S: \(E(Y^0|T=1,S) = E(Y^0|T=0,S)\)

Exact matching

  • If we match cases exactly on all observed characteristics, the treatment is necessarily independent of all those characteristics within groups.
  • Unfortunately, there is no way to test the conditional independence assumption with the potential outcome variables because we can’t observe both \(Y^0\) and \(Y^1\). There might always be something else out there that opens a backdoor path.

What should I use for S variables?

  • We think so much about omitted variable bias that we don’t often consider the risk of overcontrolling
  • Using any approach, there is a risk of removing some of the true effect of \(T\) by controlling for (or conditioning on) post-treatment variables

Don’t condition on post-treatment variables

If we control for A and B here, we’re not unconfounding the relationship between T and Y. Rather, we’re controlling away part of the true effect of T! This is why we shouldn’t control for (or stratify on) post-treatment variables.

Three assumptions

  1. Selection on observables (CIA)
  2. Overlap (any individual case has a non-zero probability of treatment)
  3. Stable unit treatment value assumption (SUTVA; among other things, that the treatment status and effect of treatment is independent for each case)

TE heterogeneity

SES n degree earnings
1 150 0 2000
1 50 1 4000
2 100 0 6000
2 100 1 8000
3 50 0 10000
3 150 1 14000

What are the ATE and ATT?

SATE = $2667

SATT = $3000

Recap of procedure

  1. Take the differences between treated and untreated groups within each stratum of \(S\)
  2. Weight these differences by the right distribution for the estimand of interest:
  • For the ATT, weight differences by the distribution of \(S\) for treated cases
  • For the ATE, weight the differences by the total sample distribution of \(S\)

What about this?

SES n degree earnings
1 150 0 2000
1 0 1 NA
2 100 0 6000
2 100 1 8000
3 50 0 10000
3 150 1 14000

What is the ATE here? ATT?

Why ATT is the most common estimand

  1. As the last example illustrates, sometimes we can only estimate one treatment effect of interest (if that!).
  2. Since treated cases are usually less frequent than control cases, ATT is often easier to get and is the default in many routines.
  3. We will often focus on the ATT in the interest of time but the ATE (and even ATU) are important in their own right.

Key term: common support

Strata that have only treatment or control cases (not both) are called off support. Strata with both treatment and control cases are in the region of common support.

Focal example

What is the effect of maternal smoking on infant health?

Data are from a subsample (N = 4642) of singleton births in PA between 1989-1991. See Almond et al. 2005. “The Costs of Low Birth Weight.”

Bring in the data

d <- haven::read_dta(here("data", "cattaneo2.dta"))
d <- d |>  
  haven::zap_labels() |>             # remove Stata labels
  mutate( smoker = factor(mbsmoke, 
                          labels = c("Nonsmoker", "Smoker")) ,
          zweight = (bweight - mean(bweight)) / sd(bweight)) %>%
  select( bweight, zweight, lbweight, mbsmoke, smoker, mmarried, 
          mage, medu, fbaby, alcohol, mrace, nprenatal )

Summary statistics

d |>  
  select_if(is.numeric) %>% 
  psych::describe(fast = TRUE)
          vars    n    mean     sd    min     max   range   se
bweight      1 4642 3361.68 578.82 340.00 5500.00 5160.00 8.50
zweight      2 4642    0.00   1.00  -5.22    3.69    8.91 0.01
lbweight     3 4642    0.06   0.24   0.00    1.00    1.00 0.00
mbsmoke      4 4642    0.19   0.39   0.00    1.00    1.00 0.01
mmarried     5 4642    0.70   0.46   0.00    1.00    1.00 0.01
mage         6 4642   26.50   5.62  13.00   45.00   32.00 0.08
medu         7 4642   12.69   2.52   0.00   17.00   17.00 0.04
fbaby        8 4642    0.44   0.50   0.00    1.00    1.00 0.01
alcohol      9 4642    0.03   0.18   0.00    1.00    1.00 0.00
mrace       10 4642    0.84   0.37   0.00    1.00    1.00 0.01
nprenatal   11 4642   10.76   3.68   0.00   40.00   40.00 0.05

Smoker/non-smoker differences

T-test of the difference

estimate conf.low conf.high
0.476 0.404 0.547

The t-test gives us the naive estimate of the effect of smoking. If we assume the only difference between smokers and non-smokers is smoking, the effect of smoking is to reduce birthweight by .476 standard deviations (276g).

Density plot

Introduction to MatchIt

  • Implements many different forms of matching
  • Many different options to cover (later)
  • Basic syntax:
match_object <- matchit(formula,
                        data = df)

Exact matching by dummy variables

# match on 3 dummy variables only
ematch_out <- 
  matchit(mbsmoke ~ mmarried + alcohol + mrace , 
          data = d,
          method = "exact")

# confirm all are matched
ematch_out

Note: later we will define formulas so we don’t have to re-type all the variables every time.

Exact matching by dummy variables

A matchit object
 - method: Exact matching
 - number of obs.: 4642 (original), 4642 (matched)
 - target estimand: ATT
 - covariates: mmarried, alcohol, mrace
               Control Treated
All (ESS)     3778.000     864
All           3778.000     864
Matched (ESS) 2081.684     864
Matched       3778.000     864
Unmatched        0.000       0
Discarded        0.000       0

Get the matched data

# get the matched data
ematch_data <- match.data(ematch_out)

# glimpse data
glimpse(ematch_data)

The matched data

Rows: 4,642
Columns: 14
$ bweight   <dbl> 3459, 3260, 3572, 2948, 2410, 3147, 3799, 3629, 2835, 3880, …
$ zweight   <dbl> 0.16813549, -0.17566764, 0.36336038, -0.71469567, -1.6441734…
$ lbweight  <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ mbsmoke   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ smoker    <fct> Nonsmoker, Nonsmoker, Nonsmoker, Nonsmoker, Nonsmoker, Nonsm…
$ mmarried  <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ mage      <dbl> 24, 20, 22, 26, 20, 27, 27, 24, 21, 30, 26, 20, 34, 21, 23, …
$ medu      <dbl> 14, 10, 9, 12, 12, 12, 12, 12, 12, 15, 12, 12, 14, 8, 12, 12…
$ fbaby     <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, …
$ alcohol   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mrace     <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ nprenatal <dbl> 10, 6, 10, 10, 12, 9, 16, 11, 20, 9, 14, 5, 13, 8, 4, 10, 13…
$ weights   <dbl> 0.6057834, 1.3247009, 0.6057834, 0.6057834, 0.6057834, 2.360…
$ subclass  <fct> 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, …

Subclassification

subclass_table <- ematch_data %>% 
  group_by(mmarried, alcohol, mrace) %>% 
  summarize(
    n_t = sum(mbsmoke),                            # Ntreat
    n_c = sum(1-mbsmoke),                          # Ncon
    zbw_t = weighted.mean(zweight, w = mbsmoke),   # mean std bw for treated
    zbw_c = weighted.mean(zweight, w = 1-mbsmoke), # mean std bw for control
    row_diff = zbw_t - zbw_c,                      # mean treat-control diff
    wt_t = weighted.mean( weights, w = mbsmoke),   # mean treat weight
    wt_c = weighted.mean( weights, w = 1-mbsmoke)) # mean control weight

Subclassification table

# A tibble: 8 × 10
# Groups:   mmarried, alcohol [4]
  mmarried alcohol mrace   n_t   n_c  zbw_t   zbw_c row_diff  wt_t  wt_c
     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>    <dbl> <dbl> <dbl>
1        0       0     0   113   373 -0.592 -0.493   -0.0996     1 1.32 
2        0       0     1   291   539 -0.391  0.0506  -0.442      1 2.36 
3        0       1     0    29    14 -0.970 -0.253   -0.717      1 9.06 
4        0       1     1    22    13 -0.379 -0.514    0.135      1 7.40 
5        1       0     0    19   182 -0.775 -0.262   -0.512      1 0.456
6        1       0     1   362  2613 -0.252  0.205   -0.457      1 0.606
7        1       1     0     4     6 -0.824 -0.0936  -0.731      1 2.92 
8        1       1     1    24    38 -0.328  0.383   -0.711      1 2.76 

What are the weights doing?

Unweighted

Weighted

Manually calculating ATT/ATE

# manually calculate ATT
with(subclass_table, 
     sum(n_t*row_diff) / sum(n_t) )
[1] -0.4082609
# manually calculate ATE
with(subclass_table, 
     sum((n_t+n_c)*row_diff) / sum(n_t+n_c) )
[1] -0.4211786

ATT by weighted least squares (WLS)

m_att <- lm(zweight ~ mbsmoke , data = ematch_data ,
            weights = weights)
tidy(m_att)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.0212    0.0164      1.30 1.95e- 1
2 mbsmoke      -0.408     0.0380    -10.8  1.14e-26

Statistical theory of matching

The jury is still out on the theory behind some of procedures so it’s not obvious in many cases what the right way is to calculate standard errors.

We will cover bootstrapping later which is now (once again) seen as a good option for getting standard errors.

Lessons from exact matching

Exact matching contains:

Matching: compare treatment/control differences in outcome on cases that are identical on other observed characteristics (i.e., matched)

Weighting: apply weights so that these stratum-specific differences are aggregated in a way that reflects the distribution of interest (e.g., ATT)

Exact matching isn’t often practical

ematch_out2 <- matchit(mbsmoke ~ mmarried + alcohol + mrace + 
                         fbaby + mage + medu + nprenatal , 
                 data = d,
                 method = "exact")
ematch_out2
A matchit object
 - method: Exact matching
 - number of obs.: 4642 (original), 1235 (matched)
 - target estimand: ATT
 - covariates: mmarried, alcohol, mrace, fbaby, mage, medu, nprenatal
summary(ematch_out2)[[2]] # get match summary
                Control Treated
All (ESS)     3778.0000     864
All           3778.0000     864
Matched (ESS)  478.5805     362
Matched        873.0000     362
Unmatched     2905.0000     502
Discarded        0.0000       0

“Feasible” estimates of TEs

  • If we do drop cases because of common support concerns, we signal this by calling our estimands “feasible” (i.e., the best we could do!)
  • If we drop any cases, we can get FSATE but not SATE
  • If we drop treatment cases, we can get FSATT but not SATT

But we don’t want a method that requires us to throw away tons of data if we don’t have to!

Moving beyond exact matching

  1. Propensity-score based approaches (parametric or semi-parametric)
  2. Non-PS-based approaches (non-parametric)
  3. Using parametric models (i.e., regression) on a data set preprocessed by (1) or (2)